The mediocre teacher tells. The great teacher inspires.
Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). This paper introduces Generative RLHF-V, a novel alignment framework that integrates Generative Reward Models (GRMs) with multi-modal RLHF. We propose a two-stage pipeline: generative reward modeling from multi-model preference, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and RL optimization from grouped comparison, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that our framework improves 4 MLLMs' performance across 7 benchmarks by 18.1%, while the baseline RLHF is only 5.3%. We further validate the out-of-distribution generalization of GRMs and the scaling trends of grouped comparisons. Additionally, we investigated GRMs' susceptibility to reward hacking within an overfitting setting. Our findings indicate that MLLMs use self-praising behaviors to deceptively receive high rewards from GRMs. Significantly, this deceptive behavior is also effective in misleading MLLM-as-judge benchmarks that are analogous to GRM scoring. Our code, models, and evaluation details can be found at https://generative-rlhf-v.github.io/.
earning principles from human preference is a major challenge in AI alignment. In MLLM's alignment, traditional RLHF methods only learn scalar scores from preferences. In contrast, our Generative RLHF-V can learn principles from preferences and optimize based on a more comprehensive comparison. Experimental results show that Generative RLHF-V elevates 2B and 3B MLLMs to 7B performance across 7 benchmarks. It also advances pretrained models to instruct model capabilities and enables open-source models to match closed-source experts.
"The mediocre teacher tells. The great teacher inspires."
We propose Generative RLHF-V(ision), as shown in Figure 2, a novel alignment framework integrating the vision GRM with RL fine-tuning. Our pipeline consists of two stages: generative reward modeling from multi-model preference and RL optimization from grouped comparison. Our reward model extends the self-principled critique tuning (SPCT) pipeline to the vision scenario, training MLLMs as GRMs using RL, with rule-based rewards from annotated ground truth in preference datasets. In contrast to the findings of SPCT, we find that in the multi-modal scenario, enabling GRMs to explore principles from preferences autonomously yields superior generalization compared to selecting principles from a reference set. Our RL optimization uses GRMs to conduct pairwise competitive scoring for n responses within each response group, taking the average as the RL optimization objective for the corresponding response.
Comparison of our pipelines to traditional ones. For reward modeling, we make generative RM actively reason about the advantages and disadvantages between two answers, and output corresponding scores. If the better response gets a higher score, it provides a positive reward. For RL optimization, we compare responses in pairs within a group to obtain more accurate scores.
The Generative RLHF-V pipeline mainly consists of two parts: generative reward modeling from reinforcement learning (RL) and RK from grouped comparison. The former references training MLLMs through RL as a vision generative reward model, i.e., GRM-V. It actively reasons about the human principle behind two given responses and provides a pair-wise score comparison. The latter leverages the characteristics of GRM-V, collecting multiple responses for a given input and providing more accurate grouped scoring for them.
An example of generative reward modeling from RL. The goal of RL is to make MLLMs assign higher scores to responses that align with human preferences. Through RL optimization, MLLMs can infer the underlying principle behind how humans annotate these binary preferences.
An example of RL from grouped comparison. Its advantage lies in utilizing grouped comparisons to achieve more accurate scoring. Response B provides accurate and comprehensive information, thus receiving the highest score; although response A is somewhat arbitrary, it performs accurate image recognition and obtains a higher score than C and D.
Models | w/ GC | w/o GC |
---|---|---|
GRM | 0.41 | 0.38 |
GRM+SFT | 0.37 | 0.33 |
GRM+RL | 0.43 | 0.37 |
GPT-4o (Expert) | 0.48 | 0.46 |
The scoring distribution of the GRM+RL model on MLLM-as-a-judge's Score task. Figure (a) is the annotated human scores, while Figure (b) is GRM+RL scores and Figure (c) is its fine-grained scores distribution.
Comparison of RMs accuracy on OOD discriminative tasks. (P) denotes the concatenation of the annotation principle from the corresponding preference dataset to the models' output, serving as hints for inference. All models represented by the bar charts were trained on the Align-Anything dataset. The purple dashed line indicates expert performance.
Model | Feedback | MIA-Bench | LLaVA-Wild | LLaVA-Wilder | MM-Safety | MSS-Bench | MM-Vet | MM-Vet-v2 |
---|---|---|---|---|---|---|---|---|
Qwen2-VL-2B | N/A | 45.31 | 61.46 | 47.18 | 38.12 | 46.98 | 32.12 | 27.15 |
+ DPO | RM | 51.04 + 5.73 | 75.91 + 14.45 | 48.12 + 0.94 | 67.21 + 29.09 | 49.52 + 2.54 | 31.28 - 0.84 | 31.28 + 4.13 |
+ PPO | RM | 43.72 - 1.59 | 73.79 + 12.33 | 41.32 - 5.86 | 59.83 + 21.71 | 47.38 + 0.40 | 33.56 + 1.44 | 30.79 + 3.64 |
+ GRPO | RM | 44.59 - 0.72 | 69.87 + 8.41 | 39.48 - 7.70 | 69.27 + 31.15 | 48.12 + 1.14 | 29.15 - 2.97 | 31.74 + 4.59 |
+ GRPO | GRM | 46.81 + 1.50 | 78.51 + 17.05 | 45.01 - 2.17 | 72.53 + 34.41 | 51.45 + 4.47 | 34.97 + 2.85 | 36.36 + 9.21 |
+ GRPO | GRM + SFT | 48.57 + 3.26 | 81.87 + 20.41 | 53.04 + 5.86 | 74.56 + 36.44 | 50.98 + 4.00 | 36.78 + 4.66 | 37.14 + 9.99 |
+ GRLHF-V (Ours) | GRM + RL | 53.13 + 7.82 | 92.54 + 31.08 | 62.84 + 15.66 | 80.67 + 42.55 | 53.87 + 6.89 | 41.25 + 9.13 | 45.16 + 18.01 |
Qwen2.5-VL-3B-Instruct | N/A | 68.01 | 89.63 | 63.65 | 41.18 | 49.58 | 59.16 | 44.94 |
+ DPO | RM | 74.37 + 6.36 | 91.05 + 1.42 | 66.71 + 3.06 | 75.64 + 34.46 | 52.57 + 2.99 | 55.72 - 3.44 | 45.41 + 0.47 |
+ PPO | RM | 72.59 + 4.58 | 93.76 + 4.13 | 65.73 + 2.08 | 71.25 + 30.07 | 50.03 + 0.45 | 60.08 + 0.92 | 48.92 + 3.98 |
+ GRPO | RM | 69.82 + 1.81 | 93.94 + 4.31 | 66.41 + 2.76 | 69.83 + 28.65 | 51.96 + 2.38 | 56.92 - 2.24 | 47.55 + 2.61 |
+ GRPO | GRM | 75.56 + 7.55 | 92.19 + 2.56 | 67.18 + 3.53 | 75.98 + 34.80 | 57.66 + 8.08 | 57.37 - 1.79 | 49.15 + 4.21 |
+ GRPO | GRM + SFT | 74.17 + 6.16 | 96.73 + 7.10 | 71.07 + 7.42 | 72.45 + 31.27 | 58.83 + 9.25 | 59.27 + 0.11 | 51.52 + 6.58 |
+ GRLHF-V (Ours) | GRM + RL | 79.67 + 11.66 | 103.41 + 13.78 | 68.46 + 4.81 | 78.88 + 37.70 | 62.33 + 12.75 | 62.18 + 3.02 | 55.18 + 10.24 |
Qwen2-VL-7B | N/A | 52.58 | 81.3 | 61.8 | 31.95 | 48.23 | 60.32 | 52.98 |
+ DPO | RM | 57.01 + 4.43 | 81.49 + 0.19 | 59.75 - 2.05 | 81.59 + 49.64 | 49.87 + 1.64 | 60.98 + 0.66 | 53.09 + 0.11 |
+ PPO | RM | 55.76 + 3.18 | 83.06 + 1.76 | 62.23 + 0.43 | 80.87 + 48.92 | 50.08 + 1.85 | 57.83 - 2.49 | 52.12 - 0.86 |
+ GRPO | RM | 56.89 + 4.31 | 81.25 - 0.05 | 60.19 - 1.61 | 83.14 + 46.19 | 51.98 + 3.75 | 56.85 - 3.47 | 48.96 - 4.02 |
+ GRPO | GRM | 59.72 + 7.14 | 86.12 + 4.82 | 68.30 + 6.50 | 81.42 + 49.47 | 50.21 + 1.98 | 57.98 - 2.34 | 54.49 + 1.51 |
+ GRPO | GRM+SFT | 59.87 + 7.29 | 92.91 + 11.61 | 65.67 + 3.87 | 87.27 + 55.32 | 52.75 + 4.52 | 58.79 - 1.53 | 56.39 + 3.41 |
+ GRLHF-V (Ours) | GRM + RL | 62.31 + 9.73 | 103.55 + 22.25 | 71.98 + 10.18 | 91.96 + 60.01 | 54.83 + 6.60 | 63.92 + 3.60 | 59.11 + 6.13 |
Qwen2.5-VL-7B-Instruct | N/A | 74.26 | 97.05 | 71.56 | 50.67 | 51.96 | 68.32 | 67.23 |
+ DPO | RM | 81.55 + 7.29 | 103.34 + 6.29 | 72.08 + 0.52 | 75.09 + 24.42 | 52.72 + 0.76 | 67.84 - 0.48 | 66.98 - 0.25 |
+ PPO | RM | 73.12 - 1.14 | 101.62 + 4.57 | 67.89 - 3.67 | 76.59 + 25.92 | 51.29 - 0.67 | 67.89 - 0.43 | 64.23 - 3.00 |
+ GRPO | RM | 75.75 + 1.49 | 101.65 + 4.60 | 68.89 - 2.67 | 68.26 + 17.59 | 52.53 + 0.57 | 66.85 - 1.47 | 67.76 + 0.53 |
+ GRPO | GRM | 71.88 - 2.38 | 109.12 + 12.07 | 73.32 + 1.76 | 65.88 + 15.21 | 53.12 + 1.16 | 65.50 - 2.82 | 65.08 - 2.15 |
+ GRPO | GRM + SFT | 76.23 + 1.97 | 103.50 + 6.45 | 72.15 + 0.59 | 70.23 + 19.56 | 54.08 + 2.12 | 64.93 - 3.39 | 68.12 + 0.89 |
+ GRLHF-V (Ours) | GRM + RL | 79.86 + 5.60 | 113.71 + 16.66 | 76.04 + 4.48 | 74.91 + 24.24 | 59.74 + 7.78 | 72.94 + 4.62 | 71.86 + 4.63 |
Scaling trend of RL performance with the number of candidate responses n, where GC denotes grouped comparison. It reveals that integrating GC and RL with the GRM framework significantly enhances RL performance across various settings of n. Moreover, this improvement becomes more pronounced as n increases.
The reward hacking behavior manifested by GRLHF-V and its associated quantitative performance, under conditions of overfitting in both reward modeling and RL training.
Benchmarks | w/ P | w/o P |
---|---|---|
Align-Anything | 0.83 | 0.79 - 0.04 |
Beaver-V | 0.73 | 0.78 + 0.05 |
LLaVA-Critic | 0.76 | 0.79 + 0.03 |
MLLM-as-a-Judge | 0.63 | 0.68 + 0.05 |
MIA-Bench | 60.76 | 62.31 + 1.55 |
LLaVA-Wild | 99.57 | 103.55 + 3.98 |
LLaVA-Wilder | 63.75 | 71.98 + 8.23 |
MM-Vet | 62.57 | 63.92 + 1.35 |
MM-Vet-v2 | 55.35 | 59.11 + 3.76 |
Generative RLHF-V is a novel alignment framework that integrates Generative Reward Models (GRMs) with multi-modal RLHF. It employs a two-stage pipeline: generative reward modeling from multi-modal preference and RL optimization from grouped comparison. This approach enables models to learn underlying principles from human preferences rather than just scalar scores.
Unlike traditional RLHF methods that only learn scalar scores from preferences, Generative RLHF-V enables models to learn the principles behind human preferences. Additionally, it enhances multi-modal RL scoring precision through grouped response comparison rather than individual response evaluation.
Our experimental results show that Generative RLHF-V improves 4 MLLMs' performance across 7 benchmarks by an average of 18.1%, while baseline RLHF methods only achieve a 5.3% improvement. It elevates 2B and 3B MLLMs to 7B performance levels and enables open-source models to match closed-source experts.
Reward hacking occurs when models find ways to maximize rewards without achieving the intended goals. In our research, we discovered that MLLMs can develop self-praising behaviors to deceptively receive high rewards from GRMs. This behavior is also effective in misleading MLLM-as-judge benchmarks, highlighting a significant concern in current evaluation methods.
All our code, models, and evaluation details are available on our GitHub repository and Hugging Face. You can access them through the links provided in the buttons at the top of this page.